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Abstract 

In this paper, wc consider the problem of minimizing the sum of two convex functions 
subject to linear linking constraints. The classical alternating direction type methods usually 
assume that the two convex functions have relatively easy proximal mappings. However, many 
problems arising from statistics, image processing and other fields have the structure that only 
one of the two functions has easy proximal mapping, and the other one is smoothly convex but 
does not have an easy proximal mapping. Therefore, the classical alternating direction methods 
cannot be applied. For solving this kind of problems, we propose in this paper an alternating 
direction method based on extragradients. Under the assumption that the smooth function has a 
Lipschitz continuous gradient, we prove that the proposed method returns an e-optimal solution 
within 0(1 /e) iterations. We test the performance of different variants of the proposed method 
through solving the basis pursuit problem arising from compressed sensing. We then apply the 
proposed method to solve a new statistical model called fused logistic regression. Our numerical 
experiments show that the proposed method performs very well when solving the test problems. 
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1 Introduction 



In this paper, we consider solving the following convex optimization problem: 

mm xm ™, v mp f{x)+g{y) 

s.t. Ax + By = b (1.1) 

x € X,y G y, 

where / and g are convex functions, A € R mxn , 1? € ]R mxp , b € M m , A? and y are convex sets and 
the projections on them can be easily obtained. Problems in the form of (1.1) arise in different 
applications in practice and we will show some examples later. A recently very popular way to 
solve (1.1) is to apply the alternating direction method of multipliers (ADMM). A typical iteration 
of ADMM for solving (1.1) can be described as 



where A is the Lagrange multiplier associated with the linear constraint Ax + By = b and 7 > 
is a penalty parameter. The ADMM is closely related to some operator splitting methods such as 
Douglas-Rachford operator splitting method [7] and Peaceman-Rachford operator splitting method 
[29] for finding the zero of the sum of two maximal monotone operators. In particular, it was 
shown by Gabay [12] that ADMM (1.2) is equivalent to applying the Douglas-Rachford operator 
splitting method to the dual problem of (1.1). The ADMM and operator splitting methods were 
then studied extensively in the literature and some generalized variants were proposed (see, e.g., 
[20, 10, 14, 8, 9]). The ADMM was revisited recently because it was found very efficient for solving 
many sparse and low-rank optimization problems, such as compressed sensing [38], compressive 
imaging [35, 15], robust PCA [32], sparse inverse covariance selection [41, 30], sparse PCA [22] 
and semidefinite programming [36] etc. Moreover, the iteration complexity of ADMM (1.2) was 
recently established by He and Yuan [17] and Monteiro and Svaiter [24]. The recent survey paper by 
Boyd et al. [3] listed many interesting applications of ADMM in statistical learning and distributed 
optimization. 

Note that the efficiency of ADMM (1.2) actually depends on whether the two subproblems in 
(1.2) can be solved efficiently or not. This requires that the following two problems can be solved 
efficiently for given r > 0, w € M. n and 



[ A fc+1 := X k -j(Ax k+l + By k+1 -b), 
where the augmented Lagrangian function Cy(x,y; A) for (1.1) is defined as 




(1.2) 



y; A) := f(x) + g(y) - (A, Ax + By - b) + j\\Ax + By - bf 



(1.3) 



x := argmin^g^ f(x) + — \\Ax — w 



(1.4) 



2 



and 

V : = argmin^y g(y) + ^\\By - zf. (1.5) 

When X and y are the whole spaces and A and B are identity matrices, (1.4) and (1.5) are known 
as the proximal mappings of functions / and g, respectively. Thus, in this case, ADMM (1.2) 
requires that the proximal mappings of / and g are easy to be obtained. In the cases that A and B 
are not identity matrices, there are results on linearized ADMM (see, e.g., [38, 37, 42]) which try 
to linearize the quadratic penalty term in such a way that problems (1.4) and (1.5) still correspond 
to the proximal mappings of functions / and g. The global convergence of the linearized ADMM 
is guaranteed under certain conditions on a linearization step size parameter. 

There are two interesting problems that are readily solved by ADMM (1.2) since the involved 
functions have easy proximal mappings. One problem is the so-called robust principal component 
pursuit (RPCP) problem: 

min \\X\L +p||Y||i, s.t., X + Y = M, (1.6) 

where p > is a weighting parameter, M G jj mxn i s a given matrix, \\X\\* is the nuclear norm of 
X, which is defined as the sum of singular values of X, and ||Y||i '■= Ylij l^y'l * s the l\ norm of Y . 
Problem (1.6) was studied by Candes et al. [4] and Chandrasekaran et al. [5] as a convex relaxation 
of the robust PC A problem. Note that the two involved functions, the nuclear norm ||X||* and the 
t\ norm ||Y||i, have easy proximal mappings (see, e.g., [23] and [16]). The other problem is the 
so-called sparse inverse covariance selection, which is also known as the graphical lasso problem 
[40, 1, 11]. This problem, which estimates a sparse inverse covariance matrix from sample data, 
can be formulated as 

min -logdet(X) + (S,X) +p||X||i, (1.7) 

where the first convex function — logdet(X) + (E,X) is the negative log-likelihood function for 
given sample covariance matrix S, and the second convex function p||X||x is used to promote the 
sparsity of the resulting solution. Problem (1.7) is of the form of (1.1) because it can be rewritten 
equivalently as 

min -logdet(A) + (S, X) +p||Y||i, s.t., X - Y = 0. (1.8) 

xeR nxn ,v'eR nXn 

Note that the involved function — logdet(A) has an easy proximal mapping (see, e.g., [30] and 
[41]). 

However, there are many problems arising from statistics, machine learning and image processing 
which do not have easy subproblems (1.4) and (1.5) even when A and B are identity matrices. One 
such example is the so-called sparse logistic regression problem. For given training set {ai,bi} 7 ^L 1 
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where a\, a?, • • • , a m are the m samples and b\, . . . ,b m with bi € {—1, * = l,...,m are the 
binary class labels. The likelihood function for these m samples is YYiLi Prob(6j | cij), where 

1 



Prob(6 | a) :- 



1 + exp(— b(a T x + c)) 



is the conditional probability of the label b condition on sample a, where x € M. n is the weight 
vector and c £ K is the intercept, and a T x + c = defines a hyperplane in the feature space, 
on which Prob(6 | a) = 0.5. Besides, Prob(6 | a) > 0.5 if a T x + c has the same sign as b, and 
Prob(6 | a) < 0.5 otherwise. The sparse logistic regression (see [21]) can be formulated as the 
following convex optimization problem 

min £(x, c) + a||x[|i, (1-9) 

x,c 

where £(x, c) denotes the average logistic loss function, which is defined as 

^ m ^ m 

£(x,c) := TTProb(6j | a^) = — y^log(l + exp(— bi(ajx + c))), 

i=l i=l 

and the l\ norm ||x||i is imposed to promote the sparsity of the weight vector x. If one wants to 
apply ADMM (1.2) to solve (1.9), one has to introduce a new variable y and rewrite (1.9) as 

min XiCj3/ £(x,c) + a\\y\\x, 
s.t. x — y = 0. 

When ADMM (1.2) is applied to solve (1.10), although the subproblem with respect to y is easily 
solvable (an l\ shrinkage operation), the subproblem with respect to (x,c) is difficult to solve 
because the proximal mapping of the logistic loss function £(x, c) is not easily computable. 

Another example is the following fused logistic regression problem: 

n 

min £(x, c) + a||x||i + j3 \xj — Xj-i\. (1-H) 

3=2 

This problem cannot be solved by ADMM (1.2), again because of the difficulty of computing the 
proximal mapping of £(x,c). We will discuss this example in more details in Section 5. 

But is it really crucial to compute the proximal mapping exactly in the ADMM scheme? After 
all, ADMM can be viewed as an approximate dual gradient ascent method. As such, computing 
the proximal mapping exactly is in some sense redundant, since on the dual side the iterates are 
updated based on the gradient ascent method. Without sacrificing the scale of approximation to 
optimality, an update on the primal side based on the gradient information (or at least part of it), 
by the principle of primal-dual symmetry, is entirely appropriate. Our subsequent analysis indeed 
confirms this belief. 
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Our contribution. In this paper, we propose a new alternating direction method for solving 
(1.1). This new method requires only one of the functions in the objective to have an easy proximal 
mapping, and the other involved function is merely required to be smooth. Note that the afore- 
mentioned examples, namely sparse logistic regression (1.9) and fused logistic regression (1.11), are 
both of this type. In each iteration, the proposed method involves only computing the proximal 
mapping for one function and computing the gradient for the other function. Under the assumption 
that the smooth function has a Lipschitz continuous gradient, we prove that the proposed method 
finds an e-optimal solution to (1.1) within 0(l/e) iterations. We then compare the performance of 
some variants of the proposed method using basis pursuit problem from compressed sensing. We 
also discuss in details the fused logistic regression problem and show that our method can solve 
this problem effectively. 



2 An Alternating Direction Method Based on Extragradient 



In this section, we consider solving (1.1) where / has an easy proximal mapping, while g is smooth 
but does not have an easy proximal mapping. Note that in this case, ADMM (1.2) cannot be 
applied to solve (1.1) as the solution for the second subproblem in (1.2) is not available. 

We propose the following extragradient-based alternating direction method (EGADM) to solve 
Problem (1.1). Starting with any initial point y° € y and A G R m , a typical iteration of EGADM 
can be described as: 

r fc||2 

j Wh 

(2.1) 



x k+1 


= arg 


mm x G A- C 1 (x,y k ;X k ) + 


yk+1 


= [y k 


--yV v £(x k +\y k ;\ k )]y 


- X k+l 


= x k 


- -/(Ax k+1 + By k - b) 


yk + 1 


= [y k 


-jV y £(x k+ \y k+1 ;\ k+1 )}y 


X k+1 


= x k 


-~/(Ax k+1 + By k+1 -b), 



where [y]y denotes the projection of y onto y, £ 7 (x,y;A) is the augmented Lagrangian function 
for (1.1) as defined in (1.3), H is a pre-specified positive semidefinite matrix, and C(x,y;X) is the 
Lagrangian function for (1.1), which is defined as 



C(x, y; A) := f(x) + g(y) - (A, Ax + By-b). 



(2.2) 



Note that the first subproblem in (2.1) is to minimize the augmented Lagrangian function plus a 
proximal term ^\\x — x k \\'j 1 with respect to x, i.e., 



„fe+i ._ 



argmin^ f{x) - (X k ,Ax + By k - b) + l\\Ax + By" - b\\' 2 + 



x Wh- 



(2.3) 
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In sparse and low-rank optimization problems, the proximal term 5 II s- ^Wh 1S usuan y imposed to 
cancel the effect of matrix A in the quadratic penalty term. One typical choice of H is H = when 
A is identity, and H = tI — ^yA T A when A is not identity, where r > 7A max (j4 T ^4) and A max (^4 T j4) 
denotes the largest eigenvalue of A T A. We assume that (2.3) is relatively easy to solve. Basically, 
when A is an identity matrix, and / is a function arising from sparse optimization such as l\ norm, 
£2 norm, nuclear norm etc, the subproblem (2.3) is usually easy to solve. 

Because we do not impose any structure on the smooth convex function g, the following subproblem 
is not assumed to be easily solvable to optimality: 

min £ 7 (x,y;A) (2.4) 

y^y 

for given x and A. Thus, in the extragradient-based ADM (2.1), we do not solve (2.4). Instead, 
we take gradient projection steps for Lagrangian function £(x, y; A) for fixed x and A to update y. 
Note that since the gradient of C(x, y; A) with respect to A is given by 

V x C(x,y;X) = -(Ax + By-b), (2.5) 

the two updating steps for A in (2.1) can be interpreted as 

A fc+1 : = \ k + 1 V x C{x k+l ,y k - \ k ) (2.6) 



and 



Hence, by defining 




A fc+i . = X k + ^ x c{x k+1 ,y k+1 - X k+1 ). (2.7) 



V y C(x,y\X) 




we can rewrite (2.1) equivalently as 



z k+1 

z k+l 



argmin xe ^/3 7 (a;, y k ; X k ) + ^\\x — 

[z k -~fF(x k+ \z k )] z (2. 
[z k - 7 F{x k+ \z k+l )} z . 



Now it is easy to see that the two steps for updating z in (2.8) are gradient projection steps for 
the Lagrangian function £(x k+1 ,y; A) for (y k , X k ) and (y k+1 , X k+1 ) respectively. The steps for y are 
gradient descent steps and the steps for A are gradient ascent steps, because the original problem 
(1.1) can be rewritten as a minimax problem: 

min max f(x)+g(y) — (X,Ax + By — b), (2-9) 
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which minimizes the Lagrangian function with respect to x and y but maximizes the Lagrangian 
function with respect to A. Since there are two gradient steps for updating z, the second gradient 
step can be seen as an extragradient step. The idea of extragradient is not new. In fact, the extra- 
gradient method as we know it was originally proposed by Korpelevich for variational inequalities 
and for solving saddle-point problems [18, 19]. Korpelevich proved the convergence of the extragra- 
dient method [18, 19]. For recent results on convergence of extragradient type methods, we refer 
to [28] and the references therein. The iteration complexity result for extragradient method was 
analyzed by Nemirovski in [27]. Recently, Monteiro and Svaiter [25, 26, 24] studied the iteration 
complexity results of the hybrid proximal extragradient method proposed by Solodov and Svaiter 
in [31] and its variants. More recently, Bonettini and Ruggiero studied a generalized extragradient 
method for total variation based image restoration problem [2]. 



3 Iteration Complexity 

In this section, we analyze the iteration complexity of EGADM, i.e., (2.1). We show that under the 
assumption that the smooth function g has a Lipschitz continuous gradient, EGADM (2.1) finds 
an e-optimal solution (defined in Definition 3.1) to Problem (1.1) within 0(l/e) iterations. The 
Lagrangian dual problem of (1.1) is 



where 



maxd(X), (3.1) 

A 



d(A) = min £(x,y;X). 



The e-optimal solution to Problem (1.1) is thus defined as follows. 

Definition 3.1 We call (x,y) 6 X x y and A £ M m a pair of e-optimal solution to Problem (1.1) 
if the following holds, 



max ( £(x, y; A) — £(x, y; A) I < e, (3.2) 

xex*,yey*,XeA* V / 

where X* x y* is the optimal set of the primal problem (1.1) and A* is the optimal set of the dual 
problem (3.1). 

The e-optimal solution defined in Definition 3.1 measures the closeness of the optimal solution to 
the optimal set in terms of the duality gap. This is validated by the following saddle point theorem. 



Theorem 3.1 (Saddle-Point Theorem) The pair (x*,y*;X*) with (x*,y*) G X x y and X* £ M ; 
is a primal-dual optimal solution pair to (1.1) if and only if (x*,y*) and X* satisfy 

£(x*,y*;X) < £(x*, y*; A*) <£{x,y;X*), for all x G X,y G y, X € M m , 
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i.e., (x*, y*; A*) is a saddle point of the Lagrangian function C(x, y; A). 



It is well known that the weak duality holds for the primal problem (1.1) and the dual problem 
(3.1), i.e., d(X) < f(x) + g(y) for any feasible solutions (x,y) and A. In our particular case, the 
Lagrangian dual variable A is associated with a set of linear equality constraints. The strong duality 
holds if the primal problem has an optimal solution. Furthermore, we assume that the optimal set 
(X*,y*) to the primal problem (1.1) and the optimal set A* to the dual problem (3.1) are both 
bounded. This assumption indeed holds for a wide variety of problem classes (e.g. when the primal 
objective function is coercive and continuously differentiable) . Now we are ready to analyze the 
iteration complexity of EGADM (2.8), or equivalently, (2.1), for an e-optimal solution in the sense 
of Definition 3.1. We will prove the following lemma first. 

Lemma 3.2 The sequence {x k+l ,z k ,z k } generated by (2.8) satisfies the following inequality: 



( 1 F(x k+1 ,z k+1 ),z k+1 - z k+1 ) - ±\\z k - z k+1 \\ 2 
< 7 2 ||F(rE fc+1 ,z fc + 1 ) - F(x k+1 ,z k )\\ 2 - ±\\z k+1 - z k \\ 2 - ±||z fc+1 - z k+1 " 2 



Proof. Note that the optimality conditions of the two subproblems for z in (2.8) are given by 

(z k -jF{x k+1 ,z k ) -z k+l ,z-z k+l ) < 0, VzeZ, (3.4) 

and 

(z k -jF(x k+ \z k+1 )-z k+ \z-z k+1 ) <0, VztZ. (3.5) 

Letting z = z k+1 in (3.4) and z = z k+l in (3.5), and then summing the two resulting inequalities, 
we get 

|| z *+i _2*+i||2 < 1 (F(x k+1 ,z k )-F(x k+ \z k+1 ),z k+1 -z k+1 ), (3.6) 

which implies 

_ _ fc+ i|| < 7 || j p( a .fc+i )Z fe)_ j p( a; fc+i )f fe+i)||. (3.7) 

Now we are able to prove (3.3). We have, 

{~/F(x k +\z k+1 ),z k+1 -z k+l ) - \\\z k - z k+1 \\ 2 
= j(F{x k+1 , z k+1 ) - F(x k+1 ,z k ), z k+1 - z k+1 ) 
+ 1 {F{x k+l ,z k ),z k+1 - z k+l ) - \\\z k -z k+l \\ 2 

< j(F{x k+1 , z k+1 ) - F(x k+1 ,z k ), z k+1 - z k+1 ) 
-\-(z k — z k+1 z k+1 — z k+1 ) — -\\z k — z k+1 \\ 2 

= 7 (F(x fc+1 , z k+1 ) - F{x k+1 , z k ),z k+1 - z k+1 ) 

-l\\z k \\ 2 + {z k ,z k+1 ) + {z k+1 ,z k+1 - z k+1 ) - \\\z k+1 \\ 2 

< 7 ||F(x fc+1 ,z fc+1 ) -F(x k+1 ,z k )\\ ■ \\z k+1 - z k+1 \\ 
(-l\\z k \\ 2 + {z k ,z k+1 ) - i||z fc+1 || 2 ) + (4||^ +1 || 2 + (z k+1 ,z k+1 ) - l\\z k+1 \\ 2 ) 

< ^11^(^+1,^+1) _ F(x k +\z k )\\ 2 - ±\\z k+1 - z k \\ 2 - i||z fc+1 - z k+1 \\ 2 , 



(3. 
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where the first inequality is obtained by letting z = z + in (3.4) and the last inequality follows 
from (3.7). This completes the proof. □ 



We next prove the following lemma. 



Lemma 3.3 Assume that Vg{y) is Lipschitz continuous with Lipschitz constant L g , i.e., 

\\Vg(yi) - Vg(y 2 )\\ < L g \\ yi - y 2 \\, Vyi,y 2 G y. (3.9) 



By letting 7 < 1/(21,), where L := (max{2L^ + X max (B T B), 2X m£LX (B T B)}) 1 , the following 
equality holds, 

z k+1 \\ 2 <0. 



in- 



(7F(x fc+1 ,z fc+1 ),z fc+1 -z fc+1 )--||z fc 



(3.10) 



Proof. For any z\ 6 Z and z 2 € -Z, we have, 

||F(x fc+1 ,zi) -F(x fc+1 ,z 2 )|| 2 

' (VsfoO - B T \ ± ) - {Vg{y 2 ) - B T X 2 ) ' 
^(Ax k+l + By x - h) - (Ax k+l + By 2 - b) 
(Vg( yi ) - Vg{y 2 )) - B T (X 1 - X 2 )\\ 2 + \\B( yi - y 2 )\\ 2 
< 2\\Vg( yi ) - Vg(y 2 )\\ 2 + 2||5 T (Ai - X 2 )\\ 2 + \\B( yi - y 2 )\\ 2 



< 2L 2 \\ yi - y 2 \\ 2 + 2A max (S T J B)||Ai - A 2 || 2 + X maK {B T B)\\y 1 - y 2 \\ 2 



< max{2L2 + A max (£ T £), 2X max (B T B)} 
= L 2 \\zi-z 2 \\ 2 , 



V\ ~ V2 

Ai - A 2 , 



where the second inequality is due to (3.9) and the last equality is from the definition of L. Thus, 
we know that F(x k+l , z) is Lipschitz continuous with Lipschitz constant L. Since 7 < 1/(2L), we 
have the following inequality, 



j 2 \\F(x k+1 ,z k+1 ) -F\ 



x k+l z fc )|| 2 _ lll^fc+l — z k \\ 2 — -\\z k+l 



Z 



fc+l||2 



k+1 ^k\\\2 _ 1|| 



< ~f 2 \\F{x k+1 ,z k+1 ) - F(x" 

< (l 2 L 

< 



»fc||2 



l)\\z k+1 - z k \\ 2 



which combining with (3.3) yields (3.10). 



□ 



We further prove the following lemma. 



Lemma 3.4 Under the same assumptions as in Lemma 3.3, the following holds: 



1 



z — z 



fc+l||2 



1 



z-z k \\ 2 < {jF(x k+L ,z k+L ),z-z k+i ). 



fc+i ~k+i^ 



(3.11) 
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Proof. Adding 

( 7 F(x k+ \z k+l ), z - z k+l ) - ^\\z k+1 - z k \\ 2 
to both sides of (3.5), we get, 

(z k - z k+ \ z - z k+1 ) - ^\\z k+l -z k \\ 2 < ( 7 F(x fc+1 , z k+l ), z - z k+1 ) - ^\\z k+l - z k \\ 2 . (3.12) 

Notice that the left hand side of (3.12) is equal to ^\\z — z k+1 \\ 2 — ^\\z — z k \\ 2 . Thus we have, 

_1_ 1 1 ^ 1 1 1 2 ^ 1 1 g 1 1 2 



< ( 7 F(x fc+1 , z k+1 ), z - z k+1 ) - \\\z k+1 -z k \\ 2 

= {~fF{x k+ \z k+1 ),z-z k+1 ) + ( 7 F(x k+1 ,z k+1 ),z k+1 -z k+1 ) - \\\z k+1 -z k \ 
fc+i ~fc+l\ „ 



(3.13) 



< (jF{x k+L ,z k+L ),z-z k+L ), 
where the last inequality is due to (3.10). □ 



We now give the (9(l/e) iteration complexity of (2.1) (or equivalently, (2.8)) for an e-optimal 
solution to Problem (1.1). 

Theorem 3.5 Consider Algorithm EG ADM (2.1), and its sequence of iterates. For any integer 
N > 0, define 

N N N 

k=l k=l k=l 

Suppose that the optimal solution sets of the primal problem (1.1) and the dual problem (3.1) are 
both bounded. Moreover, assume that Vg(y) is Lipschitz continuous with Lipschitz constant L g , 
and we choose 7 < 1/(2L) ; where L := (max{2L^ + A max (-B T -B), 2A max (-B T -B)}) 5 . Moreover, we 
choose H := if A is an identity matrix, and H := tI — ^A T A when A is not identity, where 
t > j X ma , K (A T A) . We have the following inequalities: 

max ( C(x N ,y N : X) — C(x, y: X N )] < — — max \\z — z°\\ 2 H — max \\x — x°\\ 2 , 

where Z* := y* x A*. This implies that when N = 0(l/e), {x N ,y N ,X N } is an e-optimal solution 
to Problem (1-1); i.e., the iteration complexity of (2.1) (or equivalently, (2.8)) for an e-optimal 
solution to Problem (1.1) is 0(l/e). 

Proof. The optimality conditions of the subproblem for x in (2.1) are given by 

{df{x k+1 ) - A T X k + 1 A T {Ax k+l + By k -b) + H(x k+1 - x k ), x - x k+1 ) > 0, Vx € X. (3.14) 
By using the updating formula for A fc+1 in (2.1), i.e., 

yk+l ._ X k + 7VA £( x fe+i 5 y k. A fc) =X k_ ^ Ax k+1 + By k _ ^ 
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we obtain, 



(df(x k+1 

Combining (3.11) and (3.15), we have 



A T x k+i + H ( x k+i 



x K ),x-x k+1 ) >0, VxG*. 



(3.15) 



> 



27 



fx — X k+1 \ 

y - y k+1 
\x-~x k+l J 

\z-z k+1 \\ 2 



/df{x k+1 ) -^ T A fc+1 \ 

Vg(y k+1 ) - B T X k+1 
\ Ax k+l + By k+1 -bj 



(3.16) 



y k\\2 



x 



k+l 



), X — x k+1 ). 



Since 

(H(x k -x k+1 ),x-x k+1 ) 



-\\x k -x\\ 2 N +-\\x k -x k+1 \\ 2 H +-\\x-x k+1 \\ 2 H > -\\x- 
2 2" 2 ~ 2 



fc+l||2 



1 



1 1 'y 1 1 2 



summing (3.16) over fc = 0, 1, . . . , N yields, 



"27 AT 



Z — Z 



1 1 2 



2N" X ^ ll/f 

V :c /:(x fc + 1 ,y fe + 1 ,A fc + 1 ) N 
V y C(x k+1 ,y k+1 ,X k+1 ) 



1 1 2 



— N 2^k=l 



X — X 



k+l 



y-y 



k+l 



+w Y,k=i(-VxC(x k+ \y k+ \ \ k+1 ), A - X k+1 ) 



N 

< ^rEk=i(^,y,~x k+1 

< c(x, y , i Eli A fe+1 ) - c$ Eli * k+1 , i Eli y fc+1 ; A) 

= C(x,y;~X N )-C(x N ,y N ;X). 



C(x k +\y k +\ A fc+1 )) + i EfeLi (^(^ +1 , y k+ \ A fc+1 ) - £(z fe+1 ,|/ fe+1 , A)) 
£(x fe+1 ,y fc+1 ,A)) 



Thus we have, 



maXsgA^e^AeA* \C(x N ,y N ; A) - £(x, y; A 7 ^ 
(maxAgA* C(x N , y N ; A) - mm x( zx*,y£y* max Ae A* C{x, y, A)) 
+ (maxAeA* mm x€X * jye y* C(x, y; A) - mm x€X * >yey * C(x, y; X 



N ■ 



< 



v0||2 



X — X 



1 1 2 



2^ ma.x z£Z * \\z - z"\\" + ^ max^** 
Because 3^* and A* are all bounded, iV = 0(l/e) will guarantee that 

max (c(x N ,y N ;X)-C(x,y;X N )) < e, 

and this completes the proof. 



□ 



4 Variants of EGADM 



In this section, we propose some variants of EGADM (2.1). Note that in (2.1), we used the 
augmented Lagrangian function in the subproblem with respect to x, while we used the Lagrangian 
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function in computing the gradient steps for y and y. So one variant of (2.1) we will propose in 
this section is to use the augmented Lagrangian function in computing the gradient steps for y and 
y, and it can be described as follows: 



(4.1) 





= arg 


m[n xex £~y{x,y k ;X k ) + \\\x- 


y k+1 


= [y k 


- 7 V y £ 7 (x fc+1 ,y fe ;A fe )] y 


- X k+i 


= x k 


- 7 (Ac fc+1 + By k - b) 


y k+1 


= [y k 


- 7 V 2/J C 7 (x fe + 1 ,y fc + 1 ;A fc + 1 )] 3 ; 


X k+i 


= x k 


--/(Ax k+1 + By k+1 -b). 



Note that the gradient steps for A and A are unchanged, because the gradients of the Lagrangian 
function and the augmented Lagrangian function with respect to A are the same. By dropping the 
extra gradient steps in (2.1) and (4.1), we get the following two other variants of EGADM: 



and 



x k+i 

y k+1 
X k+i 



x k+1 

yk + 1 
X k+1 



argmin^ C y (x,y k ;X k ) + \ 

[y k 
x k 



pk \\ 2 H 



1 W y C(x k+ \y k ;X k )]y 



, k+ \ 

.k+l I R„,fc+1 



(4.2) 



■y(Ax k+1 + By k 



b). 



argmin^g^ £~ f (x,y k ; X k ) + ^\\x — x 
[y k - 1 V y C J (x k+1 ,y k ;X k )]y 
X k - ~/{Ax k+1 + By k+1 - b). 



k\\2 
H 



(4.3) 



We let "GL" denote the ADM based on gradient for Lagrangian function (4.2), "GAL" denote the 
ADM based on gradient for augmented Lagrangian function (4.3), "EGL" denote the ADM based 
on extra gradient for Lagrangian function (2.1), and "EGAL" denote the ADM based on extra 
gradient for augmented Lagrangian function (4.1). For the performance of these four algorithms, 
we have the following observations based on our numerical results in Section 6. The performance 
of "GL" is the worst among these four algorithms. It usually does not converge or converges very 
slowly. The performance of "EGAL" is the best among these four algorithms. There is no significant 
difference between the performance of "GAL" and "EGL", and they usually perform worse than 
"EGAL" but better than "GL" . Although at this point we are only able to prove the convergence 
and iteration complexity for "EGL", i.e. (2.1), our numerical results suggest that "EGAL", i.e. 
(4.1), should have at least the same convergence properties, which presently remains an unsolved 
problem. 

5 Fused Logistic Regression 



In this section, we show how to use Algorithm (2.1) to solve the fused logistic regression problem, 
which is a convex problem. To introduce our fused logistic regression model, we need to introduce 
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fused lasso problem and logistic regression first. The sparse linear regression problem, known as 
Lasso [33], was introduced to find sparse regression coefficients so that the resulting model is more 
interpretable. The original Lasso model solves the following problem: 

1, 



mm — \\Ax 
2" 



6|| , s.t. < s, 



(5.1) 



where A = [a±, . . . , a m ] T E jjmx™ gives the predictor variables, b = [&i,...,6 m ] T £ ]R m gives 
the responses and the constraint ||sg||i < s is imposed to promote the sparsity of the regression 
coefficients x. The Lasso solution x gives more interpretability to the regression model since the 
sparse solution x ensures that only a few features contribute to the prediction. 

Fused lasso was introduced by Tibshirani et al. in [34] to model the situation that there is certain 
natural ordering in the features. Fused lasso adds a term to impose the sparsity of x in the gradient 
space to model natural ordering in the features. The fused lasso problem can be formulated as 



mm 



1 - 
-\\Ax - b\\ 2 + a\\x\\i +f3^2\ Xj 



Xj-l\ 



(5.2) 



j=2 



Because (5.2) can be transformed equivalently to a quadratic programming problem, Tibshirani 
et al. proposed to solve (5.2) using a two-phase active set algorithm SQOPT of Gill et al. [13]. 
However, transforming (5.2) to a quadratic programming problem will increase the size of the 
problem significantly, thus SQOPT can only solve (5.2) with small or medium sizes. Ye and Xie 
[39] proposed to solve (5.2) using split Bregman algorithm, which can be shown to be equivalent 
to an alternating direction method of multipliers. Note that (5.2) can be rewritten equivalently as 



min \\\Ax - b\\ 2 + a|H|i 

S.t. W = X 

y = Lx, 



(5.3) 



where L is an (n — 1) x n dimensional matrix with all ones in the diagonal and negative ones in the 
super-diagonal and zeros elsewhere. The ADMM (split Bregman) for solving (5.3) can be described 
as 

argminj. £ 7 (x, w k , y k ; A*, A|) 
argmin^ £ 7 (x fc+1 , w, y; , A|) 



(w k+1 ,y k+1 ) 



k+l 
1 

k+l 



X k - -f{w k+1 



-x k+l ) 
Lx k+1 ), 



(5.4) 



where 

£- f (x,w,y;\i,\ 2 ) 



7, 



1 1 



Ax— b\\ +a\\w\\i+(3\\y\\i — (\i,w—x) — (\2,y—Lx) + — \\w—x\\ + — \\y— Lx 



is the augmented Lagrangian function for (5.3), Ai and A2 are Lagrange multipliers associated with 
the two constraints. Note that the subproblem for x in (5.4) requires to minimize a positive-define 
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quadratic function, which can be time consuming, especially when the size of A is large. The 
subproblem for w and y corresponds to the t\ shrinkage operations that can be given in analytical 
form. 

As in [21], the sparse logistic regression problem (1.9) can also be formulated as 

min £(x,c), s.t. ||x[|i < s. (5-5) 



It is now very meaningful to consider the fused logistic regression problem when there is certain 
natural ordering in the features. This leads to the following optimization problem: 



min £(x, c) + allxlji + 8 > \xj—Xj—i\ 

i=2 



Problem (5.6) can be rewritten equivalently as 



(5.6) 



mm xeR rl ,w£M. rl - 1 
S.t. 



a||x||i + 
x = y 
w = Ly. 



kill + %>c) 



(5.7) 



If we apply the ADMM as in (5.4) to solve (5.7), we will end up with the following iterates: 



(x k+1 ,w k+1 ) 

(y k+ \c k+1 ) 

X k+l 

A 2 



argmra^ £ y (x,w,y k ,c k ; \±, \ k ) 
argmin^ Cy(x k+1 , w k+1 , y, c; X k , A§) 



A 2 



Xf - 7 (rr 



(5.J 



where the augmented Lagrangian function Cj(x, w, y; X\, A2) is defined as 

£y(x,w,y,c; Ai,A 2 ) := ^TT=i lo s{^ + ew{-bi{ a J V + c ))) 

+a||x||i + /3\\w\\i - (Xi,x - y) - (X 2 ,w - Ly) + %\\x - y\\ 2 + %\\w - Ly\\ 2 . 

However, note that although the subproblem for (x, w) is still easy, the subproblem for y is no longer 
easy because of the logistic loss function £(y,c). But, since £(y,c) is differentiable with respect 
to (y,c), we can apply our extragradient-based ADM to solve (5.7). Based on the discussions 
in Section 4, we know that "EGAL", i.e., ADM with extragradients for augmented Lagrangian 
function, usually performs the best. We thus only show the details of "EGAL" for solving (5.7) in 
(5.9). One can easily get the corresponding "EGL" for solving (5.7) just by replacing the augmented 
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Lagrangian function by the Lagrangian function when computing the gradients for y steps: 



(x k+1 ,w k+l ) 


:= arg 


! min x,™ C>y(x,w,y k ,c k ;X k ,X%) 




y k+1 


:= y k 


- 1 V y C 1 (x k+ \w k + 1 ,y k ,c k ] X k ,X k ) 




c k+1 


:= c k 


- -fV c £~ / (x k+1 ,w k+1 ,y k , c k ; X k , A§) 




x k+1 


■= A} 


- 7(rr A:+1 — 




xf 


:= X k 


- 7 (u; fc+1 - Ly k+1 ) 




y k+1 


■= y k 


- 7 V ?/ £ 7 (x fc+1 , w k+ \y k+l ,c k+l ; X k+1 


Tfc+l 


c k+i 


:= c k 


- 7 V c £ 7 (x k+1 , w k+1 ,y k+1 , c k+l ; A^' +1 




x k+1 


■= AJ 


— j(x k+1 — y k+1 ) 




\ k + 1 

A 2 


:= X k 


-j(w k+1 -Ly k+1 ). 





(5.9) 



Now we show how to compute the updates in (5.9) explicitly. First, the subproblem for (x, w) can 
be reduced to the following two subproblems: 



x 



k+l ._ 



OL 



argmm x 



1 



klli + 



i 



and 



IV 



k+l 



■ P\\ II 

argmm w ~|M|i + - 
7 2 



w 



7 



Ly k + ^ 

7 



These two subproblems have closed-form solutions given by 

x k+1 : = Shrink(y fc + X k /j, a/7) 

and 

w k+1 := Shrink(L/ + A^/7,/3/7), 
where the i\ shrinkage operator Shrink(z, r) is defined as 

Shrink(z,r) := sign(z) o max{|z| — r, 0}. 



(5.10) 
(5.11) 

(5.12) 
(5.13) 

(5.14) 



To compute the updates for y and c, we need to compute the gradients of £(y, c) with respect to y 
and c. It is easy to check that they can be obtained by: 

1 



111 



-A [ (l-d), V c £(y,c) 



1 



111 



b 1 (1 - d), d= l./(l + exp(-Ay -bo c )), (5.15) 



where A = [hai, b 2 a 2 , b m a m ] T . 

Based on these discussions, we can summarize the extragradient-based ADM for solving (5.7) as 
Algorithm 1. 



15 



Algorithm 1 Extragradient-based ADM for the Fused Logistic Regression 
Initialization: A = [&iai, &2«2j ■ ■ ■ , &md m ] T 
for k = 0, 1, . . . do 

x k+1 ■= Shrink(y fc + A1/7, a/7) 
w k+1 := Shrink(Ly fc + A^/7,/3/7) 

d k := l./(l + exp(-V-6o C fc )), V y e(y k ,c k ) := -ii T (l-d fc ), V c ^ fc ,c fe ) := -±b T (l-d k ) 
yk+i ._ y k _ 7 ( V ^(y fe , c k ) + X\ + L T X 2 + j(y k - x k+l ) + jL T {Ly k - w k+1 )) 
c k+1 := c k — j\7 c l(y k , c k ) 

A^ +1 := X k - j(x k+l - y k ), X k+1 := X 2 - j(w k+1 - Ly k+l ) 
d k+l ■= l./(l + exp{-Ay k+1 -bo c k+l )) 

W y l{y k+ \c k+1 ) := -^i T (l - d k+l ), Vd{y k+ \c k+1 ) := -±b T (l - d k+1 ) 

y k+i ._ y k _ 7 (V/(y fe + 1 , c fe+1 ) + A^ +1 + L T X k+1 + 7 (y fe+1 - a; fc+1 ) + 7 L T (Ly fe+1 - u; fc+1 )) 

c fc+l . = c k _ ^ j^-k+l ^k+l) 

A^ +1 : = X\ - 7 (x fe+1 - y fc+1 ), X k 2 +1 := X k 2 - -f(w k+1 - Ly k+l ) 
end for 



6 Numerical Experiments 

In this section, we test the performance of our extragradient-based ADM for solving basis pursuit 
problem arising from compressed sensing and the fused logistic regression problem (5.6). Our codes 
were written in MATLAB. All numerical experiments were run in MATLAB 7.12.0 on a laptop with 
Intel Core 15 2.5 GHz CPU and 4GB of RAM. 

6.1 Numerical Results for Basis Pursuit 

The cardinality minimization problem arising from compressed sensing seeks the sparsest solution 
of a linear system. It can be formulated as 

min ||x||o, s.t., Ax = b, (6.1) 

where A 6 W nxn (without loss of generality, we assume A has full row rank), b € W 71 and ||x||o 
counts the number of nonzeros of x. The theory of compressed sensing shows that under certain 
conditions, the cardinality minimization problem (6.1) is equivalent to the following i\ minimization 
problem with high probability: 

min IHh, s.t., Ax = b. (6.2) 
Problem (6.2) is also known as the basis pursuit problem [6]. 

In this section, we show the performance of the four algorithms discussed in Section 4, i.e., "GL", 
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"GAL", "EGL" and "EGAL", by solving the basis pursuit problem (6.2). There are many existing 
efficient algorithms for solving (6.2). We are not trying to compare our algorithms with the existing 
methods for solving (6.2). We use (6.2) only to illustrate the performance of the four algorithms 
"GL", "GAL", "EGL" and "EGAL". We show in the following how we use "EGL", i.e., (2.1) to 
solve (6.2). The other three algorithms are similarly implemented. First, we rewrite (6.2) as 



mm^gRn^gRn ||X||l 

s.t. 



x - y = 0, 

y ey, 



(6.3) 



where y := {y \ Ay = b}. The Lagrangian function and the augmented Lagrangian function for 
(6.3) are 

C(x,y;\) := ||x||i - (A, a;- y), 



and 



T 2 

£~y(x, V, A) := ||x||i - (X,x -y) + - \\x - y\\ . 



(6.4) 
(6.5) 



Then the extragradient-based ADM (2.1) for solving (6.3) can be described as 



x k+l 


= arg 




yk. ^k 


y k+l 


= [y k 






- X k + i 


= \ k 


- 7(x fc+1 - 


y k ) 


y k + l 


= [y k 


-J* k+1 ]y 




X k + 1 


= x k 


- 7(x fc+1 - 


y k+1 ) 



(6.6) 



Note that we have chosen H = 0. It is easy to see that solving the subproblem with respect to 
x corresponds to the i\ shrinkage operation. Computing the projection onto y can be done by 
solving a linear system, i.e., 



[w]y :=w + A T (AA 



T\-l, 



Aw). 



The problem instances of (6.2) in our tests were generated in the following manner. For each set of 
parameters (m,n,s), where s denotes the cardinality of the sparse solution x of (6.2), we created 
ten problem instances. The entries of A were drawn randomly according to Gaussian distribution 
jV(0, 1). We then normalized A by setting A := A/\\A\\, where ||j4|| denotes the largest singular 
value of A. To generate x, we first chose the locations of the s nonzero components of x uniformly 
random, and then drew the nonzero values of the s components uniformly random in (0, 1). Finally 
we set b = Ax. We terminated the four algorithms "GL", "GAL", "EGL" and "EGAL" whenever 
\\x k — y k \\2 < 10~ 4 . We also terminated the algorithms whenever the iteration number exceeds 
20000. The comparison results of the four algorithms are reported in Table 1. In Table 1, "Iter" 
denotes the number of iterations, and "T" denotes the CPU time in seconds. 
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From Table 1 we have the following observations. "GL" was the worst one among the four algo- 
rithms. It always achieved the maximum iteration number and the error between x k and x was 
usually high. The other three algorithms can all converge and meet the stopping criterion within 
a reasonable number of iterations. Especially, the performance of "GAL" and "EGL" is similar as 
they required similar number of iterations to meet the stopping criterion and generated solutions 
with similar errors to x. "EGAL" was the best one among the four algorithms. It needed the 
least iteration number to meet the stopping criterion and produced solutions with similar errors 
compared to "GAL" and "EGL". In terms of CPU times, "GAL" and "EGAL" were comparable, 
and "EGL" usually needed more CPU times than "GAL" and "EGAL". Thus, from these tests on 
basis pursuit problems, we see that "EGAL" usually performed better than "EGL". Although our 
current convergence proof and iteration complexity analysis only apply for "EGL" , we believe that 
"EGAL" should have at least the same convergence and complexity results. This will remain as a 
future research topic. 



6.2 Numerical Results for Fused Logistic Regression 



In this section, we report the results of our extragradient-based ADM (Algorithm 1) for solving the 
fused logistic regression problem (5.6). 

First, we used a very simple example to show that when the features have natural ordering, the fused 
logistic regression model (5.6) is much preferable than the sparse logistic regression model (5.5). 
This simple example was created in the following manner. We created the regression coefficient 
x € W 1 for n = 1000 as 

' n, j = 1,2,..., loo 

r 2 , j = 201,202,... ,300 

r 3 , j = 401, 402,..., 500 (6.7) 
r 4 , j = 601,602,... ,700 
0, else, 

where scalers r\, r%, r^, were created randomly uniform in (0, 20). An example plot of x is shown 
in the left part of Figure 1. The entries of matrix A E ^ mxn with m = 500 and n = 1000 were 
drawn from standard normal distribution Af(0, 1). Vector b € W 71 was then created as the signs of 
Ax + ce, where c is a random number in (0, 1) and e is the m-dimensional vector of all ones. We 
then applied our extragradient-based ADM (Algorithm 1) for solving the fused logistic regression 
problem (5.6) and compared the result with the sparse logistic regression problem (5.5). The codes 
for solving (5.5), which is called Lassplore and proposed by Liu et al. in [21], were downloaded from 
http://www.public.asu.edu/~jye02/Software/lassplore/. Default settings of Lassplore were used. 
We chose a = 5 x 10 -4 and /3 = 5 x 10~ 2 in (5.6). The regression result by EGADM (Algorithm 
1) is plotted in Figure 2 (a). We tested different choices of s = 1, 5, 10 in (5.5) and the results are 
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Figure 1: Left: The regression coefficient given in (6.7); Right: The regression coefficient given in 
(6.8). 



plotted in Figure 2 (b), (c) and (d), respectively. From Figure 2 we see that, the fused logistic 
regression model (5.6) can preserve the natural ordering very well. The sparse logistic regression 
model (5.5) gives very sparse solution when s is small, and gives less sparse solution when s is large, 
but none of the choices of s = 1, 5, 10 gives a solution that preserves the natural ordering. 

To further show the capability of our EGADM for solving the fused logistic regression model (5.6), 
especially for large-scale problems, we conducted the following tests. First, we created the regression 
coefficient x € W 1 for n > 100 as 



(6.? 



Note that a similar test example was used in [39] for the fused lasso problem. An example plot of 
x of size n = 500 is shown in the right part of Figure 1. We then created matrix A and vector b 
in the same way mentioned above. We applied our EGADM to solve the fused logistic regression 
model (5.6) with the above mentioned inputs A and b. We report the iteration number, CPU time, 
sparsity of x (denoted by ||x||o) and sparsity of the fused term Lx (denoted by ||Lx||o) in Table 
2. From Table 2 we see that our EGADM can solve the fused logistic regression problem (5.6) 
efficiently. It solved instances with size up to m = 2000, n = 20000 in just a few seconds. 
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Figure 2: (a): The regression result by the fused logistic regression model (5.6); (b), (c), (d): The 
regression result by the sparse logistic regression model (5.5) with s = 1, 5, 10, respectively. 
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7 Conclusion 



In this paper, we proposed a new alternating direction method based on extragradient for solving 
convex minimization problems with the objective function being the sum of two convex functions. 
The proposed method applies to the situation where only one of the involved functions has easy 
proximal mapping, while the other function is only known to be smooth. Under the assumption 
that the smooth function has a Lipschitz continuous gradient, we proved that the proposed method 
finds an e-optimal solution within 0(1/ e) iterations. We used the basis pursuit problem to illustrate 
the performance of the proposed method and its variants. We also proposed a new statistical model, 
namely fused logistic regression, that can preserve the natural ordering of the features in logistic 
regression. Preliminary numerical results showed that this new model is preferable than the sparse 
logistic regression model when there exists natural ordering in the features. The numerical results 
also showed that our extragradient-based ADM can solve large-scale fused logistic regression model 
efficiently. 
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Table 1: Numerical Results for Basis Pursuit Problem 
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Table 2: Numerical Results for Fused Logistic Reg 
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